Tidycensus will convince you to learn R

nicar.r-journalism.com/2024/

Andrew Ba Tran

March 9, 2024

Workshop agenda

  • nicar.r-journalism.com/2024/ (Follow along here)

  • Intro to Tidycensus and RStudio

  • Wrangling Census data with Tidyverse functions

  • Common Census queries

  • Visualizing Census data (if there’s time)

The American Community Survey, R, and tidycensus

What is the ACS?

  • Annual survey of 3.5 million US households

  • Covers more specific topics not available in decennial US Census data (e.g. income, education, language, housing characteristics)

  • Available as 1-year estimates (for geographies of population 65,000 and greater) and 5-year estimates (for geographies down to the block group)

  • Data delivered as estimates characterized by margins of error

How to get ACS data

tidycensus

  • R interface to the Decennial Census, American Community Survey, Population Estimates Program, and Public Use Microdata Series APIs

  • First released in 2017; nearly 500,000 downloads from the Posit CRAN mirror

  • censusapi by data journalist Hannah Recht

    • Seeks to be an API wrapper for ALL Census products

Census data issues I

  • Groups, sub groups, sub sub groups, etc, are a pain
  • Takes forever to tidy up

Census data issues II

  • Transposing the data helps a bit but
  • Still requires a lot of work to clean up

Tidycensus: Features

  • Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested)

  • Automatically downloads and merges Census geometries to data for mapping

  • Includes tools for handling margins of error in the ACS and working with survey weights in the ACS PUMS

  • States and counties can be requested by name (no more looking up FIPS codes!)

  • Script out your process for re usability

R and RStudio

  • R: programming language and software environment for data analysis (and scraping and visualization and so much more)

  • RStudio: integrated development environment (IDE) for R developed by Posit

    • Built on top of R
    • Lets you view your data, write and save R (or Python) scripts or notebooks, and view graphical static and interactive outputs

RStudio tour

Running code in R

  • <- assignment saves to the environment/memory
  • # hashes, commented out code
    • Copy and paste code into the console to run (without the hash)
  • run code in the console at the bottom or
  • in a script, highlight the code and click the ‘run’ button at the top right
  • or put your cursor in the script on the line of code and hit ctrl+enter (or cmd + enter)

Getting started with tidycensus

  • To get started, install the packages and files for this class

  • If you are using an IRE laptop, these packages are already installed for you

install.packages(c("tidycensus", "tidyverse", "mapview", "usethis"))
usethis::use_course("https://github.com/r-journalism/nicar-2024-tidycensus/archive/master.zip")

Optional: your Census API key

  • tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day

  • Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.

  • Once activated, use the census_api_key() function to set your key as an environment variable

library(tidycensus)

census_api_key("YOUR KEY GOES HERE", install = TRUE)

Getting started with ACS data in tidycensus

open 01_tidycensus.R in RStudio

Using the get_acs() function

  • The get_acs() function is your portal to access ACS data using tidycensus

  • The two required arguments are geography and variables. The function defaults to the latest 5-year ACS (Currently 2018-2022)

library(tidycensus)

median_income <- get_acs(
  geography = "county",
  variables = "B25077_001", # median household income
  year = 2022
)
  • ACS data are returned with five columns: GEOID, NAME, variable, estimate, and moe
median_income
# A tibble: 3,222 × 5
   GEOID NAME                     variable   estimate   moe
   <chr> <chr>                    <chr>         <dbl> <dbl>
 1 01001 Autauga County, Alabama  B25077_001   191800  7996
 2 01003 Baldwin County, Alabama  B25077_001   266000  6916
 3 01005 Barbour County, Alabama  B25077_001   102700 11171
 4 01007 Bibb County, Alabama     B25077_001   120100 13377
 5 01009 Blount County, Alabama   B25077_001   159800  6189
 6 01011 Bullock County, Alabama  B25077_001    87700 20560
 7 01013 Butler County, Alabama   B25077_001    94800  5984
 8 01015 Calhoun County, Alabama  B25077_001   140500  5181
 9 01017 Chambers County, Alabama B25077_001   116900  9814
10 01019 Cherokee County, Alabama B25077_001   158700  8550
# ℹ 3,212 more rows

Exploring your data with RStudio

View(median_income)

Exporting your data

  • You saved the output of the get_acs() function to the object median_income
  • Export that dataframe object to your computer so you can use it wherever you want
library(readr)

write_csv(median_income, "whatever_filename_you_want.csv", na="")

Take your data to Excel if you want

1-year ACS data

  • 1-year ACS data are more current, but are only available for geographies of population 65,000 and greater

  • Access 1-year ACS data with the argument survey = "acs1"; defaults to "acs5"

median_value_1yr <- get_acs(
  geography = "place",
  variables = "B25077_001", # median value of homes
  year = 2022,
  survey = "acs1"
)
median_value_1yr
# A tibble: 646 × 5
   GEOID   NAME                           variable   estimate   moe
   <chr>   <chr>                          <chr>         <dbl> <dbl>
 1 0103076 Auburn city, Alabama           B25077_001   335200 22622
 2 0107000 Birmingham city, Alabama       B25077_001   125500 14964
 3 0121184 Dothan city, Alabama           B25077_001   190800  8133
 4 0135896 Hoover city, Alabama           B25077_001   393400 19743
 5 0137000 Huntsville city, Alabama       B25077_001   294700 16881
 6 0150000 Mobile city, Alabama           B25077_001   178800 11552
 7 0151000 Montgomery city, Alabama       B25077_001   155200 10868
 8 0177256 Tuscaloosa city, Alabama       B25077_001   297600 30475
 9 0203000 Anchorage municipality, Alaska B25077_001   367900 10111
10 0404720 Avondale city, Arizona         B25077_001   400300 22495
# ℹ 636 more rows

Requesting tables of variables

  • The table parameter can be used to obtain all related variables in a “table” at once
income_table <- get_acs(
  geography = "county", 
  table = "B19001", 
  year = 2022
)
income_table
# A tibble: 54,774 × 5
   GEOID NAME                    variable   estimate   moe
   <chr> <chr>                   <chr>         <dbl> <dbl>
 1 01001 Autauga County, Alabama B19001_001    22308   369
 2 01001 Autauga County, Alabama B19001_002      990   265
 3 01001 Autauga County, Alabama B19001_003      656   187
 4 01001 Autauga County, Alabama B19001_004     1026   303
 5 01001 Autauga County, Alabama B19001_005     1335   329
 6 01001 Autauga County, Alabama B19001_006      741   205
 7 01001 Autauga County, Alabama B19001_007      822   218
 8 01001 Autauga County, Alabama B19001_008      840   270
 9 01001 Autauga County, Alabama B19001_009      921   260
10 01001 Autauga County, Alabama B19001_010      962   279
# ℹ 54,764 more rows

Understanding geography and variables in tidycensus

US Census Geography

Geography in tidycensus

Geography Definition Available by Available in
"us" United States get_acs(), get_decennial()
"region" Census region get_acs(), get_decennial()
"division" Census division get_acs(), get_decennial()
"state" State or equivalent state get_acs(), get_decennial()
"county" County or equivalent state, county get_acs(), get_decennial()
"county subdivision" County subdivision state, county get_acs(), get_decennial()
"tract" Census tract state, county get_acs(), get_decennial()
"block group" OR "cbg" Census block group state, county get_acs(), get_decennial()

Querying by state

  • For geographies available below the state level, the state parameter allows you to query data for a specific state

  • For smaller geographies (Census tracts, block groups), a county argument may also need to be included

  • tidycensus translates state names and postal abbreviations internally, so you don’t need to remember the FIPS codes!

Querying tract data requires county and state

  • Example: data on median home value in San Diego County, California by Census tract
sd_value <- get_acs(
  geography = "tract", 
  variables = "B25077_001", 
  state = "CA", 
  county = "San Diego",
  year = 2022
)
sd_value
# A tibble: 737 × 5
   GEOID       NAME                                     variable estimate    moe
   <chr>       <chr>                                    <chr>       <dbl>  <dbl>
 1 06073000100 Census Tract 1; San Diego County; Calif… B25077_…  1633800  71171
 2 06073000201 Census Tract 2.01; San Diego County; Ca… B25077_…  1331000 147432
 3 06073000202 Census Tract 2.02; San Diego County; Ca… B25077_…   891100  97240
 4 06073000301 Census Tract 3.01; San Diego County; Ca… B25077_…   957500 232555
 5 06073000302 Census Tract 3.02; San Diego County; Ca… B25077_…   761700 108681
 6 06073000400 Census Tract 4; San Diego County; Calif… B25077_…   799100  94490
 7 06073000500 Census Tract 5; San Diego County; Calif… B25077_…  1025000  81768
 8 06073000600 Census Tract 6; San Diego County; Calif… B25077_…   727700  92078
 9 06073000700 Census Tract 7; San Diego County; Calif… B25077_…   736400 102788
10 06073000800 Census Tract 8; San Diego County; Calif… B25077_…   678400 119751
# ℹ 727 more rows

Searching for variables

  • To search for variables, use the load_variables() function along with a year and dataset

  • The View() function in RStudio allows for interactive browsing and filtering

vars <- load_variables(2022, "acs5")
View(vars)

Available ACS datasets in tidycensus

  • Detailed Tables

  • Data Profile (add "/profile" for variable lookup)

  • Subject Tables (add "/subject")

  • Comparison Profile (add "/cprofile")

  • Supplemental Estimates (use "acsse")

  • Migration Flows (access with get_flows())

“Tidy” or long-form data

  • The default data structure returned by tidycensus is “tidy” or long-form data, with variables by geography stacked by row
age_sex_table <- get_acs(
  geography = "state", 
  table = "B01001", 
  year = 2022,
  survey = "acs1",
)
age_sex_table
# A tibble: 2,548 × 5
   GEOID NAME    variable   estimate   moe
   <chr> <chr>   <chr>         <dbl> <dbl>
 1 01    Alabama B01001_001  5074296    NA
 2 01    Alabama B01001_002  2461248  6178
 3 01    Alabama B01001_003   146169  3134
 4 01    Alabama B01001_004   158767  6029
 5 01    Alabama B01001_005   164578  5689
 6 01    Alabama B01001_006    97834  3029
 7 01    Alabama B01001_007    70450  2897
 8 01    Alabama B01001_008    42597  4156
 9 01    Alabama B01001_009    34623  3440
10 01    Alabama B01001_010    97373  4627
# ℹ 2,538 more rows

“Wide” data

  • The argument output = "wide" spreads Census variables across the columns, returning one row per geographic unit and one column per variable
age_sex_table_wide <- get_acs(
  geography = "state", 
  table = "B01001", 
  year = 2022,
  survey = "acs1",
  output = "wide" 
)
age_sex_table_wide
# A tibble: 52 × 100
   GEOID NAME        B01001_001E B01001_001M B01001_002E B01001_002M B01001_003E
   <chr> <chr>             <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
 1 01    Alabama         5074296          NA     2461248        6178      146169
 2 02    Alaska           733583          NA      385667        2351       23043
 3 04    Arizona         7359197          NA     3678381        2695      201423
 4 05    Arkansas        3045637          NA     1504488        4216       90239
 5 06    California     39029342          NA    19536425        6410     1081904
 6 08    Colorado        5839926          NA     2960896        4278      154565
 7 09    Connecticut     3626205          NA     1776689        2237       91513
 8 10    Delaware        1018396          NA      494657        1092       27456
 9 11    District o…      671803          NA      319763         733       20038
10 12    Florida        22244823          NA    10953468        6169      563703
# ℹ 42 more rows
# ℹ 93 more variables: B01001_003M <dbl>, B01001_004E <dbl>, B01001_004M <dbl>,
#   B01001_005E <dbl>, B01001_005M <dbl>, B01001_006E <dbl>, B01001_006M <dbl>,
#   B01001_007E <dbl>, B01001_007M <dbl>, B01001_008E <dbl>, B01001_008M <dbl>,
#   B01001_009E <dbl>, B01001_009M <dbl>, B01001_010E <dbl>, B01001_010M <dbl>,
#   B01001_011E <dbl>, B01001_011M <dbl>, B01001_012E <dbl>, B01001_012M <dbl>,
#   B01001_013E <dbl>, B01001_013M <dbl>, B01001_014E <dbl>, …

Using named vectors of variables

  • Census variables can be hard to remember; using a named vector to request variables will replace the Census IDs with a custom input

  • In long form, these custom inputs will populate the variable column; in wide form, they will replace the column names

Renaming variables easily

ca_education <- get_acs(
  geography = "county",
  state = "CA",
  variables = c(percent_high_school = "DP02_0062P", 
                percent_bachelors = "DP02_0065P",
                percent_graduate = "DP02_0066P"), 
  year = 2021
)
ca_education
# A tibble: 174 × 5
   GEOID NAME                       variable            estimate   moe
   <chr> <chr>                      <chr>                  <dbl> <dbl>
 1 06001 Alameda County, California percent_high_school     16.7   0.4
 2 06001 Alameda County, California percent_bachelors       28.3   0.3
 3 06001 Alameda County, California percent_graduate        21.3   0.3
 4 06003 Alpine County, California  percent_high_school     25.7   7.5
 5 06003 Alpine County, California  percent_bachelors       20.6   7.5
 6 06003 Alpine County, California  percent_graduate        18.7   8.5
 7 06005 Amador County, California  percent_high_school     30.7   2.2
 8 06005 Amador County, California  percent_bachelors       13.6   1.8
 9 06005 Amador County, California  percent_graduate         5.9   1.1
10 06007 Butte County, California   percent_high_school     22.3   0.9
# ℹ 164 more rows

ACS data warnings

Understanding limitations of the 1-year ACS

  • The 1-year American Community Survey is only available for geographies with population 65,000 and greater. This means:
  • Only 848 of 3,221 counties are available
  • Only 646 of 31,908 cities / Census-designated places are available
  • No data for Census tracts, block groups, ZCTAs, or any other geographies that typically have populations below 65,000

Data sparsity and margins of error

  • You may encounter data issues in the 1-year ACS data that are less pronounced in the 5-year ACS. For example:
  • Values available in the 5-year ACS may not be available in the corresponding 1-year ACS tables

  • If available, they will likely have larger margins of error

  • Your job as an data journalist: balance need for certainty vs. need for recency in estimates

Tagalog speakers by state (1-year ACS)

get_acs(
  geography = "state",
  variables = "B16001_099",
  year = 2022,
  survey = "acs1"
)
# A tibble: 52 × 5
   GEOID NAME                 variable   estimate   moe
   <chr> <chr>                <chr>         <dbl> <dbl>
 1 01    Alabama              B16001_099     5222  1487
 2 02    Alaska               B16001_099       NA    NA
 3 04    Arizona              B16001_099    28522  3746
 4 05    Arkansas             B16001_099       NA    NA
 5 06    California           B16001_099   760215 22953
 6 08    Colorado             B16001_099     9417  2178
 7 09    Connecticut          B16001_099     8568  2657
 8 10    Delaware             B16001_099     1112   601
 9 11    District of Columbia B16001_099       NA    NA
10 12    Florida              B16001_099    80209  7025
# ℹ 42 more rows

Tagalog speakers by state (5-year ACS)

get_acs(
  geography = "state",
  variables = "B16001_099",
  year = 2022,
  survey = "acs5"
)
# A tibble: 52 × 5
   GEOID NAME                 variable   estimate   moe
   <chr> <chr>                <chr>         <dbl> <dbl>
 1 01    Alabama              B16001_099     3854   553
 2 02    Alaska               B16001_099    18520  1414
 3 04    Arizona              B16001_099    25913  1774
 4 05    Arkansas             B16001_099     3154   515
 5 06    California           B16001_099   772833 10558
 6 08    Colorado             B16001_099     8724   834
 7 09    Connecticut          B16001_099     8353   969
 8 10    Delaware             B16001_099     2837   668
 9 11    District of Columbia B16001_099     1250   306
10 12    Florida              B16001_099    70430  2877
# ℹ 42 more rows

Other warnings

  • Variables in the Data Profile and Subject Tables can change names over time

  • The 2022 ACS is the first to include the new Connecticut Planning Regions in the “county” geography

  • The 2020 1-year ACS was not released (and is not in tidycensus), so your time-series can break if you are using iteration to pull data

The 2020 Decennial US Census data and R

What is the decennial US Census?

  • Complete count of the US population mandated by Article 1, Sections 2 and 9 in the US Constitution

  • Directed by the US Census Bureau (US Department of Commerce); conducted every 10 years since 1790

  • Used for proportional representation / congressional redistricting

  • Limited set of questions asked about race, ethnicity, age, sex, and housing tenure

2020 US Census datasets

  • The PL 94-171 Redistricting Data
  • The Demographic and Housing Characteristics (DHC) file
  • The Demographic Profile (for pre-tabulated variables)
  • Tabulations for the 118th Congress & for Island Areas
  • The Detailed DHC-A file (with very detailed racial & ethnic categories)

2020 US Census in Tidycensus

  • The get_decennial() function is used to acquire data from the decennial US Census

  • The two required arguments are geography and variables for the functions to work; for 2020 Census data, use year = 2020.

pop20 <- get_decennial(
  geography = "state",
  variables = "P1_001N",
  year = 2020
)
  • Decennial Census data are returned with four columns: GEOID, NAME, variable, and value
pop20
# A tibble: 52 × 4
   GEOID NAME                 variable    value
   <chr> <chr>                <chr>       <dbl>
 1 42    Pennsylvania         P1_001N  13002700
 2 06    California           P1_001N  39538223
 3 54    West Virginia        P1_001N   1793716
 4 49    Utah                 P1_001N   3271616
 5 36    New York             P1_001N  20201249
 6 11    District of Columbia P1_001N    689545
 7 02    Alaska               P1_001N    733391
 8 12    Florida              P1_001N  21538187
 9 45    South Carolina       P1_001N   5118425
10 38    North Dakota         P1_001N    779094
# ℹ 42 more rows

Differential privacy

  • When we run get_decennial() for the 2020 Census for the first time, we see the following messages:
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data summary file
Note: 2020 decennial Census data use differential privacy, a technique that
introduces errors into data to preserve respondent confidentiality.
ℹ Small counts should be interpreted with caution.
ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
This message is displayed once per session.

What is differential privacy?

  • The Census Bureau is using differential privacy in an attempt to preserve respondent confidentiality in the 2020 Census data, which is required under US Code Title 13

  • Intentional errors are introduced into data, impacting the accuracy of small area counts (e.g. some blocks with children, but no adults)

  • Advocates argue that differential privacy is necessary to satisfy Title 13 requirements given modern database reconstruction technologies; critics contend that the method makes data less useful with no tangible privacy benefit

Wrangling and analyzing Census data

open 02_wrangling_census_data.R in RStudio

Tidycensus functions

The basics to wrangle data

  • filter() gets rid of rows
  • mutate() adds columns to the dataframe
  • group_by() and summarize() will aggregate the data by groups
  • arrange() will sort the data
  • select() will help narrow down columns
  • Daisy chain all these functions together with |>

Case study: Racial plurality by county

View(vars) # and search for Hispanic or Latino Origin by Race

Download race Census data

county_diversity <- get_acs(geography = "county",
                            variables = c("B03002_001", # total
                                          "B03002_003", # white alone
                                          "B03002_004", # black alone
                                          "B03002_005", # native american
                                          "B03002_006", # asian alone
                                          "B03002_007", # pi alone
                                          "B03002_012" # hispanic or latino
                            ),
                            survey="acs5",
                            year=2022)
county_diversity
# A tibble: 22,554 × 5
   GEOID NAME                    variable   estimate   moe
   <chr> <chr>                   <chr>         <dbl> <dbl>
 1 01001 Autauga County, Alabama B03002_001    58761    NA
 2 01001 Autauga County, Alabama B03002_003    42635   224
 3 01001 Autauga County, Alabama B03002_004    11496   525
 4 01001 Autauga County, Alabama B03002_005       59    71
 5 01001 Autauga County, Alabama B03002_006      636   213
 6 01001 Autauga County, Alabama B03002_007        0    30
 7 01001 Autauga County, Alabama B03002_012     1864    NA
 8 01003 Baldwin County, Alabama B03002_001   233420    NA
 9 01003 Baldwin County, Alabama B03002_003   192161  1020
10 01003 Baldwin County, Alabama B03002_004    19318   730
# ℹ 22,544 more rows

Add a total population column

  • With an argument summary_var
county_diversity <- get_acs(geography = "county",
                            variables = c("B03002_003", # white alone
                                          "B03002_004", # black alone
                                          "B03002_005", # native american
                                          "B03002_006", # asian alone
                                          "B03002_007", # pi alone
                                          "B03002_012" # hispanic or latino
                            ),
                            summary_var = "B03002_001", # total population
                            survey="acs5",
                            year=2022)
county_diversity
# A tibble: 19,332 × 7
   GEOID NAME                    variable estimate   moe summary_est summary_moe
   <chr> <chr>                   <chr>       <dbl> <dbl>       <dbl>       <dbl>
 1 01001 Autauga County, Alabama B03002_…    42635   224       58761          NA
 2 01001 Autauga County, Alabama B03002_…    11496   525       58761          NA
 3 01001 Autauga County, Alabama B03002_…       59    71       58761          NA
 4 01001 Autauga County, Alabama B03002_…      636   213       58761          NA
 5 01001 Autauga County, Alabama B03002_…        0    30       58761          NA
 6 01001 Autauga County, Alabama B03002_…     1864    NA       58761          NA
 7 01003 Baldwin County, Alabama B03002_…   192161  1020      233420          NA
 8 01003 Baldwin County, Alabama B03002_…    19318   730      233420          NA
 9 01003 Baldwin County, Alabama B03002_…      512   187      233420          NA
10 01003 Baldwin County, Alabama B03002_…     2046   347      233420          NA
# ℹ 19,322 more rows

Add a percent column

  • Using the dplyr library of data wrangling functions
  • mutate() to add a new column to the data frame
library(dplyr)

county_diversity <- county_diversity |>
  mutate(percent=estimate/summary_est*100)
county_diversity
# A tibble: 19,332 × 7
   GEOID NAME                    variable   estimate   moe summary_est percent
   <chr> <chr>                   <chr>         <dbl> <dbl>       <dbl>   <dbl>
 1 01001 Autauga County, Alabama B03002_003    42635   224       58761  72.6  
 2 01001 Autauga County, Alabama B03002_004    11496   525       58761  19.6  
 3 01001 Autauga County, Alabama B03002_005       59    71       58761   0.100
 4 01001 Autauga County, Alabama B03002_006      636   213       58761   1.08 
 5 01001 Autauga County, Alabama B03002_007        0    30       58761   0    
 6 01001 Autauga County, Alabama B03002_012     1864    NA       58761   3.17 
 7 01003 Baldwin County, Alabama B03002_003   192161  1020      233420  82.3  
 8 01003 Baldwin County, Alabama B03002_004    19318   730      233420   8.28 
 9 01003 Baldwin County, Alabama B03002_005      512   187      233420   0.219
10 01003 Baldwin County, Alabama B03002_006     2046   347      233420   0.877
# ℹ 19,322 more rows

Add better variable names

  • case_when() to refactor values (within mutate())
  • .default is else or if none of the factors match
  • |> are the new pipes, aka “and then”
county_diversity_race <- county_diversity |>
  mutate(race=case_when(
    variable=="B03002_003" ~"White",
    variable=="B03002_004" ~"Black",
    variable=="B03002_005" ~"Native American",
    variable=="B03002_006" ~"Asian",
    variable=="B03002_007" ~"Pacific Islander",
    variable=="B03002_012" ~"Hispanic",
    .default = "Other"
  ))
county_diversity_race
# A tibble: 19,332 × 6
   GEOID NAME                    estimate summary_est percent race            
   <chr> <chr>                      <dbl>       <dbl>   <dbl> <chr>           
 1 01001 Autauga County, Alabama    42635       58761  72.6   White           
 2 01001 Autauga County, Alabama    11496       58761  19.6   Black           
 3 01001 Autauga County, Alabama       59       58761   0.100 Native American 
 4 01001 Autauga County, Alabama      636       58761   1.08  Asian           
 5 01001 Autauga County, Alabama        0       58761   0     Pacific Islander
 6 01001 Autauga County, Alabama     1864       58761   3.17  Hispanic        
 7 01003 Baldwin County, Alabama   192161      233420  82.3   White           
 8 01003 Baldwin County, Alabama    19318      233420   8.28  Black           
 9 01003 Baldwin County, Alabama      512      233420   0.219 Native American 
10 01003 Baldwin County, Alabama     2046      233420   0.877 Asian           
# ℹ 19,322 more rows

Group up some smaller groups

  • use group_by() to group up things
  • use summarize() to do something (usually math) on these groups
  • Let’s combine the population for Asian and Pacific Islander

Group up some smaller groups code

county_diversity_percent <- county_diversity |>
  mutate(race=case_when(
    variable=="B03002_003" ~"White",
    variable=="B03002_004" ~"Black",
    variable=="B03002_005" ~"Native American",
    variable=="B03002_006" ~"Asian Pacific Islander",
    variable=="B03002_007" ~"Asian Pacific Islander",
    variable=="B03002_012" ~"Hispanic",
    .default = "Other"
  )) |>
  group_by(GEOID, NAME, race) |>
  summarize(estimate=sum(estimate, na.rm=T),
            summary_est=mean(summary_est, na.rm=T)) |>
  mutate(percent=estimate/summary_est*100)
county_diversity_percent
# A tibble: 16,110 × 5
   NAME                    race                   estimate summary_est percent
   <chr>                   <chr>                     <dbl>       <dbl>   <dbl>
 1 Autauga County, Alabama Asian Pacific Islander      636       58761   1.08 
 2 Autauga County, Alabama Black                     11496       58761  19.6  
 3 Autauga County, Alabama Hispanic                   1864       58761   3.17 
 4 Autauga County, Alabama Native American              59       58761   0.100
 5 Autauga County, Alabama White                     42635       58761  72.6  
 6 Baldwin County, Alabama Asian Pacific Islander     2068      233420   0.886
 7 Baldwin County, Alabama Black                     19318      233420   8.28 
 8 Baldwin County, Alabama Hispanic                  11210      233420   4.80 
 9 Baldwin County, Alabama Native American             512      233420   0.219
10 Baldwin County, Alabama White                    192161      233420  82.3  
# ℹ 16,100 more rows

Sort the data frame low to high

  • Use the arrange() function
county_diversity_percent |>
  group_by(NAME) |>
  arrange(percent)
# A tibble: 16,110 × 4
   NAME                              race                   estimate percent
   <chr>                             <chr>                     <dbl>   <dbl>
 1 Bullock County, Alabama           Native American               0       0
 2 Clay County, Alabama              Native American               0       0
 3 Coosa County, Alabama             Asian Pacific Islander        0       0
 4 Greene County, Alabama            Asian Pacific Islander        0       0
 5 Lowndes County, Alabama           Asian Pacific Islander        0       0
 6 Lowndes County, Alabama           Native American               0       0
 7 Perry County, Alabama             Asian Pacific Islander        0       0
 8 Haines Borough, Alaska            Black                         0       0
 9 Hoonah-Angoon Census Area, Alaska Black                         0       0
10 Ashley County, Arkansas           Native American               0       0
# ℹ 16,100 more rows

Sort the data frame high to low

  • Use the arrange() function
  • Use the desc() function
county_diversity_percent_sorted <- county_diversity_percent |>
  group_by(NAME) |>
  arrange(desc(percent))
county_diversity_percent_sorted
# A tibble: 16,110 × 5
   NAME                                race     estimate summary_est percent
   <chr>                               <chr>       <dbl>       <dbl>   <dbl>
 1 Blaine County, Nebraska             White         384         384   100  
 2 Patillas Municipio, Puerto Rico     Hispanic    15927       15927   100  
 3 Naranjito Municipio, Puerto Rico    Hispanic    29142       29153   100. 
 4 Barranquitas Municipio, Puerto Rico Hispanic    28864       28899    99.9
 5 Canóvanas Municipio, Puerto Rico    Hispanic    42164       42218    99.9
 6 Comerío Municipio, Puerto Rico      Hispanic    18792       18817    99.9
 7 Aibonito Municipio, Puerto Rico     Hispanic    24515       24555    99.8
 8 Juana Díaz Municipio, Puerto Rico   Hispanic    46247       46333    99.8
 9 Yabucoa Municipio, Puerto Rico      Hispanic    30252       30313    99.8
10 Loíza Municipio, Puerto Rico        Hispanic    23525       23580    99.8
# ℹ 16,100 more rows

Notice there are 16,110 rows…

Narrow down the rows

  • We want one row for every county
  • Use the filter() function
county_diversity_percent_plurality <-
  county_diversity_percent |>
  group_by(NAME) |>
  arrange(desc(percent)) |>
  filter(row_number()==1)
county_diversity_percent_plurality
# A tibble: 3,222 × 5
   NAME                                race     estimate summary_est percent
   <chr>                               <chr>       <dbl>       <dbl>   <dbl>
 1 Blaine County, Nebraska             White         384         384   100  
 2 Patillas Municipio, Puerto Rico     Hispanic    15927       15927   100  
 3 Naranjito Municipio, Puerto Rico    Hispanic    29142       29153   100. 
 4 Barranquitas Municipio, Puerto Rico Hispanic    28864       28899    99.9
 5 Canóvanas Municipio, Puerto Rico    Hispanic    42164       42218    99.9
 6 Comerío Municipio, Puerto Rico      Hispanic    18792       18817    99.9
 7 Aibonito Municipio, Puerto Rico     Hispanic    24515       24555    99.8
 8 Juana Díaz Municipio, Puerto Rico   Hispanic    46247       46333    99.8
 9 Yabucoa Municipio, Puerto Rico      Hispanic    30252       30313    99.8
10 Loíza Municipio, Puerto Rico        Hispanic    23525       23580    99.8
# ℹ 3,212 more rows

Now there are 3,222 rows.

Which lines up with the county count in the U.S.

Narrow down the rows II

  • Use the slice() function
county_diversity_percent_plurality <-
  county_diversity_percent |>
  group_by(NAME) |>
  arrange(desc(percent)) |>
  slice(1)

Case study: Evictions in San Diego

sd_evictions <- read_csv("san_diego_evictions.csv")

sd_evictions
# A tibble: 736 × 2
   GEOID       total_evictions
   <chr>                 <dbl>
 1 06073000100               1
 2 06073000201               2
 3 06073000202               4
 4 06073000301               2
 5 06073000302              10
 6 06073000400               6
 7 06073000500               3
 8 06073000600               2
 9 06073000700               3
10 06073000800              20
# ℹ 726 more rows

Go back and modify your code

Copy and paste over the code you worked so hard on and change the geography and add state and county.

sd_tract_diversity <- get_acs(geography = "tract",
                              state = "California",
                              county = "San Diego",
                            variables = c("B03002_003", # white alone
                                          "B03002_004", # black alone
                                          "B03002_005", # native american
                                          "B03002_006", # asian alone
                                          "B03002_007", # pi alone
                                          "B03002_012" # hispanic or latino
                            ),
                            summary_var = "B03002_001", # total population
                            survey="acs5",
                            year=2022)

Wrangle the census tract data

Nothing changes except the names of the data frames

sd_tract_diversity_plurality <- sd_tract_diversity |> 
  mutate(race=case_when(
      variable=="B03002_003" ~"White",
      variable=="B03002_004" ~"Black",
      variable=="B03002_005" ~"Native American",
      variable=="B03002_006" ~"Asian Pacific Islander",
      variable=="B03002_007" ~"Asian Pacific Islander",
      variable=="B03002_012" ~"Hispanic",
      .default = "Other"
    )) |>
  group_by(GEOID, NAME, race) |>
  summarize(estimate=sum(estimate, na.rm=T),
            summary_est=mean(summary_est, na.rm=T)) |>
  mutate(percent=estimate/summary_est*100) |>
  group_by(GEOID, NAME) |>
  arrange(GEOID, NAME, desc(percent)) |>
  slice(1)
sd_tract_diversity_plurality
# A tibble: 737 × 5
   GEOID       race  estimate summary_est percent
   <chr>       <chr>    <dbl>       <dbl>   <dbl>
 1 06073000100 White     2116        3027    69.9
 2 06073000201 White     1926        2294    84.0
 3 06073000202 White     2873        3919    73.3
 4 06073000301 White     1331        2340    56.9
 5 06073000302 White     2182        2934    74.4
 6 06073000400 White     1984        3802    52.2
 7 06073000500 White     2002        2934    68.2
 8 06073000600 White     2044        3144    65.0
 9 06073000700 White     2934        4631    63.4
10 06073000800 White     3097        5257    58.9
# ℹ 727 more rows

Join data

  • Using inner_join() from dplyr
sd_joined <- inner_join(sd_tract_diversity_plurality, sd_evictions)

sd_joined
# A tibble: 736 × 6
   GEOID       race  estimate summary_est percent total_evictions
   <chr>       <chr>    <dbl>       <dbl>   <dbl>           <dbl>
 1 06073000100 White     2116        3027    69.9               1
 2 06073000201 White     1926        2294    84.0               2
 3 06073000202 White     2873        3919    73.3               4
 4 06073000301 White     1331        2340    56.9               2
 5 06073000302 White     2182        2934    74.4              10
 6 06073000400 White     1984        3802    52.2               6
 7 06073000500 White     2002        2934    68.2               3
 8 06073000600 White     2044        3144    65.0               2
 9 06073000700 White     2934        4631    63.4               3
10 06073000800 White     3097        5257    58.9              20
# ℹ 726 more rows

Summarize the evictions data

Now you can answer which neighborhoods in San Diego had the higher eviction rates.

sd_joined |>
  group_by(race) |>
  summarize(population=sum(summary_est, na.rm=T),
            total_evictions=sum(total_evictions, na.rm=T)) |>
  mutate(rate_of_evictions=total_evictions/population*1000) |>
  arrange(desc(rate_of_evictions))
# A tibble: 4 × 4
  race                   population total_evictions rate_of_evictions
  <chr>                       <dbl>           <dbl>             <dbl>
1 Black                          36               1             27.8 
2 Hispanic                  1123001            2121              1.89
3 White                     1968310            2816              1.43
4 Asian Pacific Islander     198354             212              1.07

Common Census queries

open 03_common_census_queries.R in RStudio

Example of iterating with loops

Here’s a basic “for loop” which includes setting the limits for the loop to 10.

for (i in 1:10) {
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

Explaining loops in R

  • Manually, this would have looked like print(1) then print(2) then print(3) one by one.

  • Each loop iterates the i within the established limits (1:10)

  • But this is a way to run code many times with slight variations to a value or values in the code. It all goes between the { and }.

Multiple years of Census data

  • Set up a way to append new data to the original data
big_census_data <- tibble() # creates a blank data frame

for (i in 2020:2022) {
  median_df <- get_acs( # temporary dataframe
    geography = "county",
    variables = "B25077_001", # median home values
    year = i
    ) |>
    mutate(year = i) # so we can identify which year
  
  big_census_data <- bind_rows(big_census_data, median_df) |>
    arrange(GEOID, year)
  # appends the temporary dataframe to the permanent one
}
big_census_data
# A tibble: 9,664 × 6
   GEOID NAME                    variable   estimate   moe  year
   <chr> <chr>                   <chr>         <dbl> <dbl> <int>
 1 01001 Autauga County, Alabama B25077_001   161200  6652  2020
 2 01001 Autauga County, Alabama B25077_001   164900  6871  2021
 3 01001 Autauga County, Alabama B25077_001   191800  7996  2022
 4 01003 Baldwin County, Alabama B25077_001   211600  5853  2020
 5 01003 Baldwin County, Alabama B25077_001   226600  5984  2021
 6 01003 Baldwin County, Alabama B25077_001   266000  6916  2022
 7 01005 Barbour County, Alabama B25077_001    86500  9981  2020
 8 01005 Barbour County, Alabama B25077_001    89500 11054  2021
 9 01005 Barbour County, Alabama B25077_001   102700 11171  2022
10 01007 Bibb County, Alabama    B25077_001    96400 13625  2020
# ℹ 9,654 more rows

Quickly calculate percent change

library(tidyr)

home_value_change <- big_census_data |>
  ungroup() |>
  filter(year!=2021) |>
  select(NAME, estimate, year) |>
  pivot_wider(names_from="year", values_from="estimate") |>
  mutate(change=round((`2022`-`2020`)/`2020`*100,2)) |>
  arrange(desc(change))
home_value_change
# A tibble: 3,230 × 4
   NAME                         `2020` `2022` change
   <chr>                         <dbl>  <dbl>  <dbl>
 1 Jackson County, South Dakota  58600 108400   85.0
 2 Crockett County, Texas       102200 181300   77.4
 3 Real County, Texas           146900 258000   75.6
 4 De Baca County, New Mexico   130000 219200   68.6
 5 Keya Paha County, Nebraska    65000 109000   67.7
 6 Daggett County, Utah         152500 252400   65.5
 7 Stewart County, Georgia       44100  72500   64.4
 8 Gem County, Idaho            198700 324700   63.4
 9 Shackelford County, Texas    110000 178500   62.3
10 Sterling County, Texas        75000 121000   61.3
# ℹ 3,220 more rows

Looping through states to get tracts

state_names <- c("DC", "MD", "VA") # Get a list of state names or abbreviations

tract_data <- tibble()
for (i in 1:length(state_names)) {
    tract_df <- get_acs(
    geography = "tract",
    variables = "B25077_001",
    year = 2022,
    state=state_names[i] # Swap out the state name in the array
    )
  
  tract_data <- bind_rows(tract_data, tract_df)
}
tract_data
# A tibble: 3,879 × 5
   GEOID       NAME                                     variable estimate    moe
   <chr>       <chr>                                    <chr>       <dbl>  <dbl>
 1 11001000101 Census Tract 1.01; District of Columbia… B25077_…   635000 368769
 2 11001000102 Census Tract 1.02; District of Columbia… B25077_…  1382800 636468
 3 11001000201 Census Tract 2.01; District of Columbia… B25077_…       NA     NA
 4 11001000202 Census Tract 2.02; District of Columbia… B25077_…  1385700 124023
 5 11001000300 Census Tract 3; District of Columbia; D… B25077_…  1110900  40872
 6 11001000400 Census Tract 4; District of Columbia; D… B25077_…  1620800 494698
 7 11001000501 Census Tract 5.01; District of Columbia… B25077_…  1131400 180121
 8 11001000502 Census Tract 5.02; District of Columbia… B25077_…  1168900 532289
 9 11001000600 Census Tract 6; District of Columbia; D… B25077_…  1426300 221155
10 11001000702 Census Tract 7.02; District of Columbia… B25077_…   365400  22326
# ℹ 3,869 more rows

Get a list of state names and/or abbreviations

  • Pull a list of state names from the depths of R with state.name
  • Pull a list of state abbreviations from the depths of R with state.abb
  • Combine them into a dataframe and don’t forget to add in DC to make a name/abbreviation relationship file
state_names <- c(state.name, "District of Columbia")
state_abbs <- c(state.abb, "DC")

state_df <- data.frame(state_names, state_abbs)
state_df
            state_names state_abbs
1               Alabama         AL
2                Alaska         AK
3               Arizona         AZ
4              Arkansas         AR
5            California         CA
6              Colorado         CO
7           Connecticut         CT
8              Delaware         DE
9               Florida         FL
10              Georgia         GA
11               Hawaii         HI
12                Idaho         ID
13             Illinois         IL
14              Indiana         IN
15                 Iowa         IA
16               Kansas         KS
17             Kentucky         KY
18            Louisiana         LA
19                Maine         ME
20             Maryland         MD
21        Massachusetts         MA
22             Michigan         MI
23            Minnesota         MN
24          Mississippi         MS
25             Missouri         MO
26              Montana         MT
27             Nebraska         NE
28               Nevada         NV
29        New Hampshire         NH
30           New Jersey         NJ
31           New Mexico         NM
32             New York         NY
33       North Carolina         NC
34         North Dakota         ND
35                 Ohio         OH
36             Oklahoma         OK
37               Oregon         OR
38         Pennsylvania         PA
39         Rhode Island         RI
40       South Carolina         SC
41         South Dakota         SD
42            Tennessee         TN
43                Texas         TX
44                 Utah         UT
45              Vermont         VT
46             Virginia         VA
47           Washington         WA
48        West Virginia         WV
49            Wisconsin         WI
50              Wyoming         WY
51 District of Columbia         DC

Common census queries

  • Diversity scores for counties
  • Poverty quantiles for counties
  • Living alone and above the ages of 65 for counties
  • Check it out in 03_common_census_queries.R

Visualizing Census data with maps

open 04_tidycensusviz.R

“Spatial” ACS data

  • One of the best features of tidycensus is the argument geometry = TRUE, which gets you the correct Census geometries with no hassle

  • get_acs() with geometry = TRUE returns a spatial Census dataset containing simple feature geometries;

Downloading “Spatial” ACS data

  • geometry = TRUE does the hard work for you of acquiring and pre-joining spatial Census data
median_value_map <- get_acs(
  geography = "tract",
  state= "MD",
  county="Baltimore City",
  variables = "B25077_001", # median values of home
  year = 2022,
  geometry = TRUE
)
  • We get back a _simple features data frame
median_value_map
Simple feature collection with 199 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -76.71152 ymin: 39.19721 xmax: -76.52945 ymax: 39.3722
Geodetic CRS:  NAD83
First 10 features:
         GEOID estimate   moe                       geometry
1  24510151200    99400 51399 MULTIPOLYGON (((-76.66506 3...
2  24510270702   218000 17102 MULTIPOLYGON (((-76.57356 3...
3  24510271802   114800 40020 MULTIPOLYGON (((-76.68352 3...
4  24510272004   210600 76149 MULTIPOLYGON (((-76.69473 3...
5  24510090700    78700 27019 MULTIPOLYGON (((-76.60149 3...
6  24510180300   209900 18838 MULTIPOLYGON (((-76.63817 3...
7  24510270703   229600 35850 MULTIPOLYGON (((-76.56192 3...
8  24510010300   338900 21159 MULTIPOLYGON (((-76.58472 3...
9  24510060300   280700 21590 MULTIPOLYGON (((-76.58773 3...
10 24510260203   152800 11899 MULTIPOLYGON (((-76.55386 3...

Exploring Census data interactively

library(mapview)

mapview(median_value_map)

Creating a shaded map with zcol

mapview(median_value_map, zcol = "estimate")

Try all the code again in a different county

median_value_map <- get_acs(
  geography = "tract",
  state= "MD", # Changeme
  county="Baltimore County", # Change me
  variables = "B25077_001", # median values of home
  year = 2022,
  geometry = TRUE
)

mapview(median_value_map, zcol = "estimate")

Migration data

county_migration <- get_flows(
  geography = "county",
  county = "Baltimore City",
  state = "MD"
)

county_migration
# A tibble: 2,664 × 7
   GEOID1 GEOID2 FULL1_NAME               FULL2_NAME     variable estimate   moe
   <chr>  <chr>  <chr>                    <chr>          <chr>       <dbl> <dbl>
 1 24510  <NA>   Baltimore city, Maryland Africa         MOVEDIN       685   194
 2 24510  <NA>   Baltimore city, Maryland Africa         MOVEDOUT       NA    NA
 3 24510  <NA>   Baltimore city, Maryland Africa         MOVEDNET       NA    NA
 4 24510  <NA>   Baltimore city, Maryland Asia           MOVEDIN      1466   279
 5 24510  <NA>   Baltimore city, Maryland Asia           MOVEDOUT       NA    NA
 6 24510  <NA>   Baltimore city, Maryland Asia           MOVEDNET       NA    NA
 7 24510  <NA>   Baltimore city, Maryland Central Ameri… MOVEDIN       322   174
 8 24510  <NA>   Baltimore city, Maryland Central Ameri… MOVEDOUT       NA    NA
 9 24510  <NA>   Baltimore city, Maryland Central Ameri… MOVEDNET       NA    NA
10 24510  <NA>   Baltimore city, Maryland Caribbean      MOVEDIN       231    94
# ℹ 2,654 more rows

Downloading map data

county_map <- get_acs(
  geography = "county",
  variable = c("Population"="B03002_001"),
  geometry = TRUE
)

county_map
Simple feature collection with 3222 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.1467 ymin: 17.88328 xmax: 179.7785 ymax: 71.38782
Geodetic CRS:  NAD83
First 10 features:
   GEOID                           NAME   variable estimate moe
1  01069        Houston County, Alabama Population   107040  NA
2  01023        Choctaw County, Alabama Population    12669  NA
3  01005        Barbour County, Alabama Population    24877  NA
4  01107        Pickens County, Alabama Population    18925  NA
5  01033        Colbert County, Alabama Population    57270  NA
6  04012         La Paz County, Arizona Population    16681  NA
7  04001         Apache County, Arizona Population    66054  NA
8  05081  Little River County, Arkansas Population    12024  NA
9  05121      Randolph County, Arkansas Population    18619  NA
10 06037 Los Angeles County, California Population  9936690  NA
                         geometry
1  MULTIPOLYGON (((-85.71209 3...
2  MULTIPOLYGON (((-88.47323 3...
3  MULTIPOLYGON (((-85.74803 3...
4  MULTIPOLYGON (((-88.34043 3...
5  MULTIPOLYGON (((-88.13925 3...
6  MULTIPOLYGON (((-114.7312 3...
7  MULTIPOLYGON (((-110.0007 3...
8  MULTIPOLYGON (((-94.48558 3...
9  MULTIPOLYGON (((-91.40687 3...
10 MULTIPOLYGON (((-118.6044 3...

Prep the migration data

county_migration_moved <- county_migration |>
  filter(variable=="MOVEDIN") |>
  filter(!is.na(GEOID2)) |>
  select(GEOID=GEOID2, migration=estimate) 

Join the migration data with the shapefile

county_map_migration <- county_map %>%
  inner_join(county_migration_moved)

mapview(county_map_migration, zcol = "migration")

Can you map migration out?

county_migration_moved <- county_migration |>
  filter(variable=="MOVEDOUT") |>
  filter(!is.na(GEOID2)) |>
  select(GEOID=GEOID2, migration=estimate) 

county_map_migration <- county_map %>%
  inner_join(county_migration_moved)

mapview(county_map_migration, zcol = "migration")

Thank you!

  • nicar.r-journalism.com/2024/